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Abstract 

This paper investigates a class of algorithms for numerical integration of a function 
in d dimensions over a compact domain by Monte Carlo methods. We construct a 
histogram approximation to the function using a partition of the integration domain 
into a set of bins specified by some parameters. We then consider two adaptations; 
the first is to subtract the histogram approximation, whose integral we may easily 
evaluate explicitly, from the function and integrate the difference using Monte Carlo; 
the second is to modify the bin parameters in order to make the variance of the Monte 
Carlo estimate of the integral the same for all bins. This allows us to use Student's 
t-test as a trigger for rebinning, which we claim is more stable than the x 2 test that 
is commonly used for this purpose. We provide a program that we have used to study 
the algorithm for the case where the histogram is represented as a product of one- 
dimensional histograms. We discuss the assumptions and approximations made, as 
well as giving a pedagogical discussion of the myriad ways in which the results of any 
such Monte Carlo integration program can be misleading. 



1 Introduction 

We are interested in evaluating the integral of a function / : M. d — > R over a compact domain. 
It is a simple matter to map any compact domain into the unit hypercube, so we need to 
evaluate ^ 

I=j dxf(x)= / dxf- I dx d f(xi,...,x d ). 
J[o,i] d Jo Jo 

I is just the average value of / over a uniform probability distribution which vanishes outside 
the unit hypercube, we denote this average value by (/). 

*R. Arthur@sms.ed.ac.uk 
t adk@ph.ed.ac.uk 
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Defining the sample average / over a set of N uniformly distributed random points 
{iW G [0,l] d ,i = l,...,d} to be 

1 - 

i=i 

the weak law of large numbers [lj allows the identification (/) = limjv->oo / assuming only 
that the integral exists. Strictly speaking the weak law of large numbers states that, with 
probability arbitrarily close to one, / will become arbitrarily close to (/) for sufficiently 
large N. 

The central limit theorem makes the stronger statement that the probability distribution 
of / tends to a Gaussian with mean (/) and variance V/N, 

(2) 




where the variance of the distribution of / values is 

v = ({f- </>) 2 ) = j oid dx (/(*) - </>) 2 = </ 2 > " </> 2 ' 

but it requires stronger assumptions that we shall discuss shortly. An unbiased estimate of 
the variance is given by 



where of course 



The estimate of the integral / is within one standard deviation a = yJV/N of the true value 
I = (f) about 68% of the time. 

Note that the error is proportional to 1/V~N independent of the dimension d of the 
integral. From this fact stems the great utility of Monte Carlo for integration in many di- 
mensions compared to numerical quadrature. Generally for numerical quadrature (trapezoid 
rule, Simpson's rule etc.) the error is 0(A k ) where A is the grid spacing and k is a small 
number. With a fixed budget of function evaluations, N, on a regular grid each axis must 
be divided into \f~N segments. So A oc N~ l t d and thus the error is 0(N- k ' d )\ therefore in 
dimension d > 2k the Monte Carlo error is smaller. An intuitive explanation for this is that 
a random sample is more homogeneous than a regular grid [2] . 



1.1 Singular Integrands 

If the integrand has a singularity within or on the boundary of the integration region extra 
care is required. Let us consider the proof of the central limit theorem. The probability 
distribution Pf for the values F = f(x) when x is chosen from the distribution P is 

P f (F) = J dxP{x)6(F-f(xj) (3) 
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for which 



J dFP f (F) = J dxP(x) = 1 and J dF P f (F) F = J dx P 



(x)f(x) = (f). 



We define the generating function for connected moments as the logarithm of the Fourier 
transform of Pf 



Assuming that (BJ can be expanded in an asymptotic series 



w f (k) = J2 



m=l 



ml 



where coefficients C m (cumulants) 

Co = 1 
C x 

C 2 



C 3 
c A 
c 5 



(/} 

V=((yf-(f)) 

f-(f)) 

/-(/)) 4 )-3C 2 

/-(/)) 5 ^-ioc 3 c 2 

/ - (/)) 6 ) - 15C 4 C 2 - IOC* - 15C 2 2 



(4) 



(5) 



are all finite. We consider the distribution function Pj for the sample average / defined in 
equation (JT]) 



p f -(F) = j dx {l) ■ ■ ■ dx {N) P(x (1) ) • • • P(x {N) ) S\^F-^J2f( 
and the corresponding generating function Wj 



x 



WAk) = In / dFP f {F)e 



JkF 



Since the points a?W were chosen independently (jHl) factorises to give 



WV(Jfe) = In 



dxP(x)e ikf{x)/N 



N 



NW f 



(6) 



expanding which gives an (asymptotic) expansion in powers of 1/N 



m=l x ' 

Ignoring terms of 0(N~ 2 ) and taking the inverse Fourier transform gives 



exp 

Pf(F) = 



-AEdM 

2V/N 



^J2tiV/N ' 

namely a Gaussian with mean (/) and variance V/N. 

However, if any of the moments C n are not finite then the expansion (jSJ) is not justified. A 
simple class of integrals with divergent higher moments is dx x a with a < 0. If— 1 < a < 
this integral is well-defined but has an infinite number of divergent moments. Following 
equation ^ 



l-e a(l-e) 

for y G [e a , 1], and zero elsewhere. The cumulants are combinations of the moments 

i _ l+an 

dyP(y)y n = - — - 

y (l-e)(l+na) 

which diverge as e — > if 1 + an < 0, or equivalently every cumulant C n with n > — 1/a 
diverges. What this means in practice is that although estimating such integrals by Monte 
Carlo is allowed — the weak law of large numbers assures us that as long as iV is large 
enough the estimate will converge but it gives no indication of how large iV should be — it 
is misleading to estimate the error from the variance alone. This is because the distribution 
of the estimates of the integral is not Gaussian even if the variance exists. In practice one 
obtains non-Gaussian distributions with "fat tails" , for some examples see Figure [TJ 

We may also observe that a singularity in the integrand does not necessarily lead to 
infinite cumulants: for the function f(x) = — Inx a similar analysis to that given above 
shows that Pf(F) = e~ F for F G [0, oo) and hence (f m ) = ml so all its cumulants are finite, 
and therefore the Monte Carlo estimates of the integral do have a Gaussian distribution as 
the number of samples iV — > oo. 

For all convergent integrals Monte Carlo provides an estimate of the mean. However, if 
any moment diverges the distribution of the means is not Gaussian, and in general even if 
the variance exists it gives an underestimate of the "width" of the probability distribution. 
This is often the case in practice, for example when evaluating Feynman parameter integrals 
which are integrable but not square integrable. If the Monte Carlo integrator is treated as 
a black box the quoted error from the standard deviation will often be an underestimate of 
the true error. With these provisos on the applicability of Monte Carlo integration we now 
turn to our main topic, a new method of adaptive Monte Carlo integration. 
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Figure 1: (a) J Q dxx~ a with a = § and (b) a = \ evaluated by Monte Carlo with 10,000 function 
evaluations. The distribution of results about the mean is distinctly non-Gaussian. For (b), which 
has worse divergences, there is more weight in the tail of the distribution. For (a) the variance 
exists but the distribution is not Gaussian due to the vanishing of the higher moments. 

2 Variance Reduction 

The Monte Carlo scheme of Section 1 (naive Monte Carlo) can be improved by variance 
reduction schemes [3]: the basic idea is to use some information about the integral in order 
to reduce the variance sample average. We describe two methods: importance sampling and 
subtraction. 

2.1 Importance Sampling 

Let p(x) be a probability distribution, normalised to one, that closely approximates our 
function over the interval. If we generate random points x chosen from the distribution p 
and construct the average (f(x)/p(x)) over these points then we have an estimate of the 
integral 



The value / of © does not depend on p, but the corresponding variance of OH]) does, so we 
may minimise the variance OH]) by varying the function p subject to the constraint that it is 
correctly normalised N(p) = J dx p(x) = 1. Performing the required functional differentia- 
tion with the Lagrange multiplier A 




(7) 



and the variance 





6[V(p) + XN(p)} 
8p{x) 
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we obtain the result 

1/(1)1 



fdx'\f(x')\- 

The corresponding value for the minimal variance is 

K P t = V( Popt ) = (J dx\f(x)\) - (J dxf(x)\ , (9) 
which vanishes if / has the same sign everywhere in the domain of integration. 

2.2 Subtraction 

In this case we have 

/ = J dxf(x) = J dx[f(x)-g(x)} + J dxg(x), (10) 
where the integral of g is known exactly: the corresponding variance is 

V(g)= [ dx[f(x)-g(x)} 2 . (11) 



As before the integral I of (flOj) is independent of g whereas the variance is not, so we may 
minimise the latter by varying g. This time there is no constraint to be imposecQ and we 
find 

and the optimal variance V opt = V(g opt ) = 0. We therefore choose g(x) such that g(x) is a 
good approximation to f(x). 

Both of these methods assume knowledge of the function that one may not have for 
complicated multidimensional integrals that arise in practice, thus it is necessary to find a 
way to construct p or g from a function treated as a black box. 

3 Adaptive Algorithms 

The usual way to construct p(x) is the method of [4], VEGAS. The key feature of this 
algorithm is that it is adaptive, that is, it automatically constructs a probability density 
that approximates |/(sc)|. 

Our analysis of the variance reduction in §2.11 assumed that the function is representable 
as a product of functions in each variable, although the correctness of the global Monte Carlo 
estimate of the integral does not depend upon this. Such a factorisation depends on the co- 
ordinate system, see [5]: in the following we will always work in Cartesian co-ordinates and 
assume that the function approximately factorises. 

Each axis is divided into a number of bins and we select a point with each coordinate 
lying equiprobably in any bin along the corresponding axis; the points are thus chosen to 

In practice g will be a piecewise constant approximation to /, but we do not impose this as a constraint. 
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lie in equiprobably in any of the boxes that are defined by the intersection of the bins. The 
probability distribution p in (J7|) is p{x) = l/\b x \ where \b x \ is the volume of the box b x 
containing x. To converge to the optimal grid the bins are resized so that each bin makes 
an equal contribution to the integral of 

The analysis in [I] assumes that the probability distribution p(x) = Y[t=i Pi( x i) factorises 
but does not explicitly assume that the integrand / does. In this case the variance is 



V( P ) 




m - 1 2 



and minimisation with respect to any particular pi subject to the normalisation constraint 
N(pi) = f dxiPi(xi) = 1 gives 



Pi, op 



[Xi OC 



\ 



/ 



This gives the optimal solution for pi when all the other pj are fixed, but does not immediately 
give the optimal solution for all the factors of p simultaneously. If we make the further 
assumption that f(x) = Yii=ifi{ x i) then the solution reduces to that which we found in 



2.1[ namely pi t 



opt 



x ; 



oc \fi(xi)\ and hence p opt (x) oc \f(x) 



pt 



We may identify some issues with this approach. Consider the optimal distribution p a 
that VEGAS is trying to find. V opt of equation flUD is not necessarily zero unless f(x) has 
the same sign everywhere. A simple example where V opt ^ is in one dimension with 



/(*) 



sin27rx, which gives V ol 



16. In simple cases it is possible to divide up the 



integration region into parts in which the sign of the integrand does not change, however 
this assumes detailed knowledge of the function and may not be feasible in a high number 
of dimensions. Compare this to the subtraction method, for which V opt = in this case. 

Secondly, there seems to be no automatic method for deciding when VEGAS has con- 
verged to a (near) optimal grid. Usually it is advised to use as few function evaluations 
as possible until the grid approximately converges and then sample on the optimal grid. 
To test this one computes the x 2 statistic. If on each iteration of VEGAS an independent 
estimate of the integral Jj is produced using N' function evaluations; after n iterations the 
variables (Ji —/),... , (J„ — J) will have a normal distribution with mean zero and variance 
cr 2 , assuming that N' is sufficiently large for central limit theorem to apply. Therefore the 
sum of their squares 

7,;-r° 



X 



n 

E 



a 



12 



has a x 2 distribution with n degrees of freedom 



prix 2 



"lp-X 2 /2 



2 n/2 T (|) 
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for x 2 > 0. Sadly, we cannot compute this \ 2 since it requires the exact values of I and a. 
If we use the estimate I ~ / then it is easy to show that the resulting quantity 



still has a \ 2 distribution but now with n — 1 degrees of freedom. However, since a 2 occurs 
in the denominator of \ 2 attempts to replace it with some stochastic estimate will tend to 
give anomalous large values for x 2 or ) to make a more precise statement, the distribution 
will not be a \ 2 distribution but something with fatter tails. 

Finally, we observe that the VEGAS rebinning algorithm will tend to produce small bins 
where the function value is large. If the integrand has a relatively flat but high plateau with 
steep edges this approach will tend to calculate the integral well in the flat region, where 
it is easy, and miss the regions of large variance which require more function evaluations to 
estimate accurately. We will see an explicit example in section 

4 Our Algorithm 

Our algorithm uses two adaptations, subtraction and rebinning, and uses Student's t-test as 
a robust trigger to decide when to apply them. 

As in all such adaptive algorithms, we trigger an adaptation whenever there is statisti- 
cally significant evidence that the current histogram parameters are not optimal, and the 
naive goal is that the histogram parameters will converge to their optimal values thereby 
minimising the variance of our Monte Carlo estimate of the integral. We say "naive" because 
this clearly cannot happen: we are constructing an ergodic Markov process and such a pro- 
cess must converge to a fixed point distribution of histogram parameters which is non-zero 
everywhere — it cannot converge to a single point in the parameter space. Of course, we 
may hope that the Markov process will converge to a distribution of histogram parameters 
strongly peaked about the optimal ones, but how well this may be achieved depends in a 
complicated way upon lots of the algorithmic details and approximations: for example, how 
many samples we have to take in each bin before using the central limit theorem to assert 
that their mean is normally-distributed, or how much we choose to damp the rebinning 
adaptations to try to avoid instabilities. 

We should put such worries in context, it is easy to see that for any adaptive Monte Carlo 
integration scheme there are functions for which they will give an answer with an unreliable 
estimate of its error. A simple example is an approximation to a 5 function within a region 
where the function is zero: not only will Monte Carlo algorithms not "see" the 6 function, 
but adaptive importance sampling will make it less likely for them to do so. 

4.1 Subtraction Adaptation 

We construct an approximation function / adaptively from the Monte Carlo process. Like 
[I] we assume that the integrand may be reasonably approximated by a product of one- 
dimensional histograms with n bins along each axis. This means we only have to accumulate 




n 



2 
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0(nd) rather than 0(n d ) values, and more significantly we can afford to evaluate a reasonably 
large number of samples per bin whereas it would be impossible to evalute even one sample 
per box in practice for n = 10 in d = 10 dimensions for example. However, we stress that 
the this is not a requirement of our method: the idea of using adaptive subtractions as well 
as adaptive importance sampling is independent of the choice of histogram representation. 
Subtraction could be used together with recursive stratified sampling [B] for example, and 
the function approximation stored not as a product of histograms as we will describe below 
but as independent values in each box as required. 

We should also stress that the Monte Carlo algorithm will give the correct value for the 
integral even if the integrand is not well approximated by a product of histograms, or even 
as a product at all. All that happens in such cases is that we have no good reasons to expect 
our adaptations to significantly reduce the statistical error in the result. 

We describe a method closely following VEGAS, assuming that our function is repre- 
sentable as a product of one-dimensional functions. Each axis is divided into a number of 
bins, with a bin along an axis defined to be the Cartesian product of an interval along that 
axis and the unit interval along every other axis. The intersection of d bins, one along each 
axis, will be called a box. 

We generate quasi-random points for the Monte Carlo sampling by choosing a random 
permutation of the bins and a random point in each bin. This ensures that each bin has 
an equal number of samples while still sampling the entire space homogeneously. This 
implements the importance sampling distribution p, as the probability of selecting a sample 
point x within any box b is l/n d , and therefore the probability density p(x) = 1/V& where Vj, 
is the volume of the box b. Initially we choose all the histogram bins to have equal width, 
and therefore all the boxes to have the same volume and therefore p(x) equal everywhere, for 
want of any better information. Likewise, we initially set the subtraction function f(x) = 
everywhere. 

We generate a set {x^\ . . . , x^} of quasi-random points chosen from the distribution p, 
and for each bin (3^ e {1, . . . , n} along each axis p 6 {1, . . . , d} we accumulate the quantities 




for k G {0,1,2}. £o(/3/J is just the number of samples in the bin so with our quasi- 
random number strategy for generating sample points it is just equal to the integer N/n for 
all bins, so we do not really need to accumulate it, although we might if we used a different 
random point generator. We define 

c _ gi(gg) A . f(x) f 

where G [3^ means that x^ falls within that is the p coordinate x^ of the point 
lies in the interval along the p axis. 

If / is a product of one-dimensional functions, f(x) = fi(xi) ■ ■ ■ fd(xd) then 
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moreover, if the average value of / M in the bin (3^ is 
with 1/3^1 = fp dx^ being the bin width, then 



where 



/d „i 
dxf(x) = TT / dx v f u (x v ) 



is the integral we wish to evaluate. We thus may construct an estimate of the value of the 
function f(x) as 

where x E (3 U for z/ e {1, . . . , d}, H = |A| • • • |/3<i| is the volume of the box b containing x, 
and 

f if _ 



i=l v ' 

is the current "global" estimator for the integral. 

We use this function for our subtraction adaptation, that is we take g = f in (jTUJ), and 
evaluate 

I' = Jdx if(x) - /(*)] = | (14) 
by Monte Carlo. The integral of / is known exactly 

d n 



/ 
^) = ^iIIE/a 



V 1 

v=ip v =l 



whence we can use (fTUl) and set I = I' + I" . 



4.2 Rebinning Adaptation 

After making a subtraction as described in section 14.11 the integrand is everywhere approxi- 
mately zero to the best of our current knowledge, so the usual importance sampling analysis 
of section 12.11 does not give us any useful way of choosing the function p. We may therefore 
choose p so as to make the variance constant over all boxes, which in our case corresponds to 
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making it constant for all bins along each axis. We introduce an estimator for the variance 
with the bin /3 M 

p _ SoW (MM » \ f (x)dx (fto-A* 

and our rebinning step adjusts the bin parameters to make the integral of this histogram 
constant within each new bin. An interpolation is performed along each axis to get the value 
of the approximation function in the new bins. The interpolation uses cubic splines to fit the 
current function approximation in the current set of bins: cubic splines perform better for 
some of our test integrals than lower-order interpolation, although there is no intrinsic reason 
to prefer them. When the new set of bins have been found the value of the interpolation 
in the centre of that bin is used for fp . As with importance sampling it is possible that 
the algorithm will rebin too much and narrow the bins very strongly about regions of high 
variance, thus we damp the rebinning algorithm by adding a small constant variance to each 
bin so that no bin can have width zero. The second part of (|14|) tells us how to make a 
subtraction even after changing the importance sampling p: we just keep the old bins to 
evaluate / (perhaps packaging them in a closure) while generating points according to the 
new bins specified by p. The value of / for the next iteration is accumulated in the new 
bins. Interpolation was easier to use in our current implementation and allows us to combine 
estimates from different iterations and avoid starting again after each rebinning. 



4.3 Student Trigger 

If our two adaptations have achieved their goal then the estimates zp of [f(x) — g(x)]/p(x) 
within bin /3 M should have mean zero and equal variance, and if we average enough samples 
within each bin to be able to apply the central limit theorem then they should follow a 
Gaussian distribution. Even though we do not know the exact value of the variance, we can 
test this hypothesis without any further approximations by computing the quantity 

t = -^=- (15) 



/V 

where 



is the average of z over all bins and 



ndind — 1) 

an unbiased estimate of the variance of p z . t has a Student t-distribution with nd — 1 degrees 
of freedom, where Student's distribution with v degrees of freedom is 



1 + 



PAt) v^(f,§) 
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Let k = A~ l (p,nd — 1) where A -1 is the inverse cumulative Student distribution, so 
we expect that —k < t < k with probability p. Therefore, if \t\ > k we may exclude 
the hypothesis that we have Gaussian distributed bin values with mean zero and the same 
variance with probability p and carry out a new adaptation. This should give a more stable 
condition for convergence towards the optimal approximation function. Each axis could have 
a separate trigger, but there does not seem to be any obvious advantage, just as there seems 
to be no reason to have a different number of bins n along each axis. 

The rebinning algorithm is a Markov process that is hopefully converging to the "optimal" 
bin distribution. However, as we stated before, an ergodic Markov process cannot converge to 
a delta function, so our automated rebinning procedure cannot find the optimal integration 
parameters (the bin widths and subtraction values). In practice this manifests itself as an 
instability in the rebinning algorithm. This is not a problem specific to our method but is 
shared by all automatic rebinning schemes; we hope and expect that our use of Student's 
test as a trigger will be more stable that those that use a \ 2 trigger, but to make this into 
a more quantitative statement would require at least a significant restriction on the class of 
integrands under consideration. 

5 Numerical examples 




(a) After adaptive subtraction (b) After adaptive importance sampling 



Figure 2: Rebinning with our algorithm (a) or importance sampling (b) for the integral ilb)) along 
with the function t(x) — t(x) where t is the approximation function adaptively constructed by our 
algorithm. The bin distribution for our algorithm concentrates the bins at the edges where the 
variance is largest. 



To demonstrate the procedure of adaptive subtraction and how it compares to adaptive 
importance sampling, we use the C language implementation of VEGAS importance sampling 
in the GNU Scientific Library (GSL) for comparison [7J. We first choose an example that 
illustrates the strengths of our method, 

/ = J dxt(x) where t = iVtanh(15a;) tanh^!5(l — x) j (16) 
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Subtraction 100 calls per bin 
Subtraction 10 calls per bin 

Importance 1 00 calls per bin ^ 
Importance 1 calls per bin 
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Figure 3: Log of the standard error over the integral evaluation calculated as described in the text 
for equation flTp versus dimension. In one dimension the subtraction method is many orders of 
magnitude better, in more than one dimension the subtraction method is still better than importance 
sampling. For this integrand, having fewer calls per bin is better for both subtraction and importance 
sampling. 
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with iV chosen to make the integral 1=1. The integrand is approximately constant through- 
out most of the integration region except at the very edges where it rises rapidly. Figure [2(a) 
shows the integrand after subtraction with the new bins, chosen such that the variance in 
each is equal. The integration grid is finest in regions where the function is changing rapidly, 
this is as it should be, integrating regions where the integrand is approximately constant 
requires far fewer function evaluations. Compare to part (b) of the figure which shows the 
bin distribution produced by importance sampling. The bin distribution has moved to where 
the function itself is largest. We integrate, 

/ dx x - ■ ■dx d t(x 1 )...t(x d ) (17) 

J [0,1]* 

for various values of d and plot the relative error in Figure [3J This figure shows the log 
of the standard error over the integral approximation using importance sampling and our 
subtraction method. We use 8 x 10 4 evaluations, with the number of bins chosen in both 
cases so that there are 10 or 100 calls per bin. This gives four opportunities to recalculate the 
bin distribution (rebin) where we apply Student's test to decide. In almost every case, with 
90% probability only one rebinning is necessary. We also check the x 2 f° r the importance 
sampling case the same number of times and rebin if the \ 2 per degree of freedom of the 
integral estimates differs from one by more than 0.5. This tends to rebin on every step. We 
could combine all the estimates, weighted by their standard deviation however this sometimes 
underestimates the error. Instead the number we quote comes from 2 x 10 4 calls on the bin 
distribution calculated by the process above. Figure El shows for this example our algorithm 
gives more accurate results. 




Iteration Iteration 

(a) Subtraction Method t-statistic (b) Importance Sampling \x 2 — 1| 

Figure 4: (a) The t statistic h!5\) (b) and \x 2 — 1| kltyl for the integral Ji7| ) in 4 dimensions, using 
subtraction (a) and importance sampling (b). Student's t-test triggers once, the \ 2 test triggers on 
every iteration. 

For the integral ( ITT)) in 4 dimensions we plot the value of the t statistic in figure Ufa), 
compared with k = A~ x (p, n — 1) for p = 0.9 and n = 20, (8,000 calls per iteration, 100 
calls per bin) for a number of iterations. We see that Student's test triggers only on the 
first iteration and then not after, the approximation function is doing as good a job as 
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can be expected. The closest equivalent for importance sampling is the \ 2 P er degree of 
freedom which we show in figure H^b). This is far from 1 at every iteration and the rebinning 
algorithm triggers at every step. We could relax the condition on the \ 2 i e -g-? \x 2 ~ M < 1 
to rebin less often but this example is indicative of both methods, Student's test triggers 
relatively infrequently and the x 2 test triggers more frequently, meaning we are less certain 
when we have found the optimal bin distribution. 
Another test integrand is, 



n 



sin27nr,-. 



18) 



In one way this is an excellent test case since the integrand oscillates and so ([7]) implies 
that even with optimal binning importance sampling cannot reduce the variance to zero. 
However this example shows a distressing feature of our algorithm. As equation (T13T) shows 
when calculating an estimate for the function in each bin we have to divide by the total 
integral, if this is zero there will be problems in d > 1. In practice it is unlikely that we 
have to integrate a function whose integral is exactly zero. The root of this problem is that 
our knowledge of the integrand is represented as a product of histograms, so if the integral 
of the histogram along any axis vanishes we have no information about how the integrand 
behaves along the other axes. 

If the value of the subtraction function in every box were stored this would not be an 
issue. We have tried this approach, which is feasible for small dimensions and small number 
of bins, on this integrand. In 2 dimensions with 20, 000 function evaluations to accumulate 
the function approximation, an additional 20, 000 to evaluate the integral and 25 bins per 
axis we accumulated the approximation in each of the 625 boxes to obtain the estimates 

Naive : -0.00446 ± 0.00354 
Subtraction : 0.00004 ± 0.00040 
Importance : -0.00400 ± 0.00300 

so we do indeed find that subtraction reduces the variance significantly. 
The integral 



dx\ ■ ■ ■ dxd TT \ — exp (- 

[0,H d i V ^ 



-m[Xi 



1\2 



) 2 ) 



erf 



(19) 



is another interesting case. When the parameter m is small, the Gaussian is broad and our 
integration scheme approximates the true value well. However as the dimension increases 
our method fails to find the peak of the Gaussian as well as VEGAS and the integration 
starts to break down. The subtraction method can be improved by increasing the number of 
points per bin, however it still does slightly worse than VEGAS. This is possibly due to the 
interpolation performed during the rebinning step failing to reconstruct the approximation 
function. There is no reason for subtraction to be better in this case and we find, for a fixed 
number of calls as the dimension increases and therefore, the number of bins per axis reduces 
for this function subtraction gets worse more quickly than importance sampling. 

Importance sampling will do better than subtraction in some cases and vice versa there- 
fore combining the two approaches may be beneficial. We can do subtraction for some 
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Subtraction 100 calls per bin -|- 
Subtraclion 1 calls per bin 

Importance 100 calls per bin ^ 

Importance 10 calls per bin |^ 



□ 



Subtraction 1 00 calls per bin -|- 
Subtraction 10 calls per bin 

Importance 1 00 calls per bin ^ 

Importance 1 calls per bin [ 



X x 
□ i 



I s x 



(a) Gaussian to = 10 



(b) Gaussian m — 100 



Figure 5: Log of the standard error over the integral evaluation for equation U9\) versus dimension 
using 4 x 10 points on the grid obtained after four iterations of 4 x 10 4 calls, (a) in = 10 the 
subtraction method works well in low dimensions, (b) m = 100 for this many function calls the 
subtraction method breaks down in dimension greater than 5 the separate integral estimates are 
inconsistent with large x 2 an d the error is large. In contrast, importance sampling works well here. 



8 







0.25 lmp+ 0.75 Sub -4- 






0.25 Imp + 0.75 Sub 


2 




0.25 Sub + 0.75 Imp ^ 


2 




0.25 Sub + 0.75 Imp ^ 














* X * * - 


2 






„ -2 




X * + 












X + + + 


-4 




******** 


-4 


\ 

- + 


^ X + 




X 










6 
8 






-6 
-8 







(a) Gaussian to = 10 (b) Gaussian m = 100 

Figure 6: Log of the standard error over the integral evaluation for equation \19\) versus dimension 
using 4x 10 4 points on the grid obtained after four iterations of Ax 10 calls. For both cases the mixed 
approach works well, retaining the advantage of subtraction in low dimensions and the stability of 
importance sampling in high. Calls per bin = 100. 
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percentage of our total number of function evaluations, generating an approximation func- 
tion and then do importance sampling on / — / for the remainder. Or we could do an initial 
run of importance sampling, then construct the approximation function on the bins with 
width given by p. Figure [6] shows the integral ffT9]) in various dimensions using both varieties 
of mixed sampling. The mixed sampling is done by using 25% of the function evaluations 
on one kind of sampling and then using the grid and approximation function found this 
way for the rest of the calls. The mixed strategies seem to give good results reproducing 
the superiority of subtraction in low dimensions and the stability of importance sampling in 
higher. 

As an interesting practical example we perform the integral derived from the Feynman 
parameterization of the scalar Feynman diagram shown in Figure [7] in 4-dimensional Eu- 
clidean space, and given explicitly in a file attached to this paper. This integral is known to 




Figure 7: </> 3 theory graph evaluated in the text. Explicitly given in attached files. 

converge (its exact value is —7.2123414 . . .) but it is not necessarily square integrable and its 
higher moments may not exist. We plot the distribution obtained from evaluating the inte- 
gral 1,000 times by subtraction and importance sampling. The distribution looks distinctly 
non-Gaussian. It is peaked around the exact value but were the standard deviation used 
to estimate the error it would be incorrect. This kind of problem with parameter integrals 
arises often in practice and we advise caution in interpreting the output of any Monte Carlo 
calculation. 

6 Conclusions 

We have reviewed Monte Carlo integration, pointing out some often overlooked issues when 
the integral exists but the variance or some higher moments do not. This leads to non- 
Gaussian distributions and inaccurate estimates of the error. We have proposed a new 
algorithm for numerical Monte Carlo integration based on adaptive subtraction. We expect 
this method to have an advantage over importance sampling Monte Carlo schemes for some 
integrands. We have implemented our proposed adaptive subtraction method and found it 
to be better than importance sampling in some cases. A combination of the two methods 
can be better than either separately. We have given an explicit program, PANICH which is 
competitive with VEGAS. PANIC also shares the assumption of factorisability into a product 

2 Program for Adaptive Numerical Integration Computations 



17 




Importance Sampling 

Subtraction 



5 10 15 20 25 

F 

Figure 8: Distribution of results for the </> 3 graph. The distribution is non-Gaussian in both cases 
with values differing significantly from the mean occuring far more often than if it were Gaussian 
— the "fat tail" of the distribution. 

of functions with VEGAS, thus improvements to VEGAS that work around this assumption, 
e.g., |8], will also work with subtraction and are an interesting future direction. The 
subtraction idea can work with any Monte Carlo scheme and the subtraction function can 
even be constructed independently, in this paper we have seen good results when combined 
with importance sampling. Our code, PANIC, is included. 
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